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1.  imOJJDCIJQK  AHJD  51LMABJ 

This  report  to  the  Advanced  Research  Projects  Agency  (ARPA)  is 
♦ he  final  one  for  the  poriod  1 October  1973  throuqh  15  Septeaber 
1974  on  systea  Developaent  Corporation's  (SDC's)  research  and 
developaent  proqraa  in  Interactive  Systeas  Research  (I SR).  An 
interin  report  ccverinq  the  first  six  aonths  of  this  prograa 
(throuqh  31  March  1974)  has  been  subaitted  (Bernstein,  1974). 

The  present  report  concentrates  on  the  proqress,  results,  and 
probleas  for  the  last  half  of  the  year's  work  and  contains  plans 
for  the  future  activities  of  the  1SR  proqraa.  The  proqraa 
presently  includes  three  prolects:  (1)  Speech  Understanding 

Research.  (2)  lexical  Data  Archive,  and  (3)  Coaaon  Inforaation 
structures.  The  overall  intent  of  SDC's  ISR  proqraa  is  to 
develop  basic  technology  for  iaproved  man- machine  interactive 
systeas  with  application  to  a variety  of  anticipated  ailitary 
needs.  The  ealor  eaphasis  at  present  is  on  speech  understanding 
and  related  probleas. 

The  Speech  Understanding  Research  prolect  contains  aany 
developaents  that  will  be  aaterial  in  enhancinq  and  iaprovinq 
interactive  systeas  by  making  thea  aore  capable  and  productive 
when  used  by  the  casual  user.  The  continuing  effort  to  perait 
such  users  to  easily  ar.d  effectively  coanunicate  with  a 
coaputer -based  systea  in  lanquaqe  fores  that  are  natural  to  then 
is  of  particular  iaportance. 

In  support  of  ARPA's  Speech  Understanding  Research  proqraa  and 
other  lanquaqe-based  efforts,  the  Lexical  Data  Archive  project 
was  started  to  create  (as  its  naae  implies)  a central  archive  of 
lexical  inforaation  for,  and  of  particular  interest  to,  all  ARPA 
contractors  workinq  in  speech  understanding  as  well  as  other 
lanauaqe-based  activities. 

The  world  of  inforaation  processing  has  been  one  of  continuous 
evolution  and  change  since  its  creation.  The  ever  increasing 
dependence  of  users  of  inforaation  processing  systeas  upon  data 
bases  and  data  aanageaent  systeas  has  become  obvious.  Kith  the 
continuously  chanqinq  environaent  of  hardware,  operatinq  systeas, 
and  data  aanageaent  systeas  within  which  data  bases  reside,  the 
need  to  wove  a data  base  with  ainiaal  effort,  cost,  and 
disruption  to  its  users  is  becoming  as  iaportant  as  smooth, 
well-engineered  user  interfaces.  The  Coaaon  Inforaation 
Structures  project  is  continuing  to  develop  the  methodoloqy 
necessary  to  perfors  data  base  transfers  in  the  appropriate  way. 

The  following  sumsarizos  these  projects'  activities  during  the 
past  year. 
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1.1  SPEECH  UNDERSTANDING  RESEARCH 

The  Speech  Understanding  Research  (5 UR)  project  is  continuing  its 
efforts  to  create  a deaonstrable  prototype  Voice-controlled  Data 
Nanaqeaent  Systea  (VDHS) , using  free-fora  spoken  Enqlish  as 
input.  Although  it  was  originally  intended  that  an  enhanced 
version  of  the  prototype  that  uas  demonstrated  to  the  SUR  Group 
Review  Tea*  in  December,  1973,  would  be  deaonstrable  in  the  fall 
of  1974,  the  effort  has  been  redirected  to  produce  a version  of 
VDR5  later  in  1974  that  incorporates  the  linguistic  processor 
being  developed  by  the  Stanford  Research  Institute  (SRI)  in 
coniunction  with  SDC's  SUR  project,  along  with  iaproveaents  that 
have  been  made  in  SDC's  phonoloqical  and  acoustic-phonet ic 
processes.  The  developaent  work  is  all  being  done  on  the  SDC 
Cozputer  Center  Facility's  IR?1  370/145  Vh-370  Operating  Systea 
with  reaote  users  accessing  the  systea  via  the  ARPA  Network. 


1.2  LEXICAL  DATA  ARCHIVE 

The  Lexical  Data  Archive  (LDA)  project  was  begun  in  October, 
1973,  with  the  obiectivo  of  providing,  via  the  AKPA  Network,  a 
centrally  available  collec  ion  of  lexical  inforaation  on  the 
union  of  lexicons  used  by  the  ARPA  SUB  contractors  in  their 
collective  and  individual  endeavors.  The  Seaant.ically  Oriented 
lexical  Archive  (SOLAR)  has  been  designed,  and  aost  of  the 
relevant  data  have  beer,  collected.  Programs  have  been  developed 
for  deriving  machine-readable  data  files  and  for  discovering  and 
displaying  links  aaong  lexicon  words,  and  a file  aanaqeaent 
facility  has  been  developed.  To  date,  data  have  been  manually 
distributed  to  the  SUR  projects  and  others.  Autoaatic  access 
will  be  provided  in  the  near  future. 


1.3  COHHON  I NPORHATI ON  STRUCTURES 

The  COaaon  INforaation  Structures  (COINS)  project  has  devised  a 
method  for  transferring  data  bases  between  disparate  data 
aanaqeaent  systems  (DHSs)  that  requires  ainiaal  effort  and  cost 
hv  utilizing  to  a aax.'.aua  degree  tho  functional  capabilities  of 
the  two  DH Ss  involved.  The  aethod  depends  upon  the  existence  of 
three  languages  to  describe  tho  various  phases  of  the 
intermediate  process  of  the  actual  transfer.  They  are  a Coaaon 
Data  Description  Language,  a coaaon  Data  Translation  Language, 
and  a Coaaon  Data  "oraat  Language.  These  languages  have  been 
defined,  and  effort  is  now  concentrating  on  refinement  and  the 
iapleaenta tion  of  lanquaqe  processors  that  will  permit  a thotouqh 
test  and  evaluation  of  the  aethod. 
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2.  S£M£JJ  JOJUUKSttlfillfi  IKIAJ&JJ 
2.1  IRTRODUCTXOR 

The  continuing  louq-tera  goal  of  the  SOC  speech  Understanding 
Research  (SUR)  proiect  is  to  develop  and  implement  a da*- a 
sanagesent  systea  that  is  controlled  and  operated  by  its  users 
through  free-fora  spoken  English.  The  basic  approach  taken  to 
achieve  this  goal  is  distinguished  by  a nodular  systea 
architecture  that  eabodies  phonoioqical  and  linquistic  processes 
and  an  acoust  ic-phonetic  processor.  The  system  architecture 
enables  a coaplete  assembly  of  nultidirect ional  parsing  processes 
to  operate  in  parallel  on  the  sane  or  different  segments  of  an 
input  utterance.  Two  ealor  advantages  are  obtained  from  this: 

(11  the  systea  nay  start  working  on  the  least  ambiquous  portions 
of  the  input,  and  (2)  predictions  need  not  be  limited  to  near 
neighbors  of  a recognized  input  sequent  but  may  be  applied  to  any 
portion  of  the  entire  utterance. 

The  acoustic-phonetic  processor  contains  the  processes  that 
extract  acoustic  information  froa  the  speech  signal  and  make 
acoustic- Phonetic  labelinq  decisions.  The  processor  we  are 
developing  reflects  the  fact  that  the  speech  siqnal  is  never 
wholly  unambiguous;  any  attempt  to  precisely  label  phones  and 
their  boundaries  auot  recognize  and  allow  for  this  ambiguity  in 
sapping  the  extremely  larqe  nuaber  of  soeecf  sounds  into  the 
relatively  small  set  of  acoustic- phonetic  transci.pt.  ion  symbols. 
Accor  linqlv.  in  chis  processor#  each  acoustic-phonetic  sequent  is 
aultiplv  labeled,-  and  each  label  is  assigned  a score.  Scores  are 
based  on  a measure  func  tion  that  is,  in  turn,  based  on  feature 
parameters  previously  developed  for  each  speaker  (user). 

As  a first  step  toward  the  lonq-tera  qoal,  we  constructed  and 
refined  a limited  voice -controlled  Data  Ranageaent  System  (VDHS) 
that  could  accept  continuous  speech  and  be  demonstrably  usable  by 
at  least  two  speakers.  Hi  thin  this  systea,  there  were 
limitations  with  respect  to  both  the  size  of  the  vocabulary  and 
the  syntax  of  the  English  subset  peraitted. 


2.2  PROGRESS  FOR  THE  FIRST  SIX-RCKTH  PERIOP 

At  the  beginning  of  the  present  contract  year,  two  separate 
versions  of  VDRS  had  been  constructed  and  tested:  Version  A, 

which  operated  on  the  SUR  laboratory  Raytheon  704  minicomputer  in 
conlunct  ion  with  an  IPfi  370/145,  and  which  incorporated  the 
aodular  system  architecture,  and  Version  B,  which  operated 
entirely  or  the  Raytheon  704,  and  which  embodied  the 
aultiple-labeliaq  philosophy  of  the  acoustic-phonetic  processor 
described  above.  Both  versions  allowed  the  user  to  access  a data 
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base  of  inforaation  about  the  lenqth,  beta,  draft,  arsaaeot,  and 
other  characteristics  of  submarines  in  the  naval  fleets  of  the 
United  States,  the  Soviet  Ursior , and  the  United  Rinqdoa.  The 
total  vocabulary  of  each  system  was  approxiaa tely  150  words.  The 
uuerv  lanquaqe  u?ed  to  access  this  inforaation  could  be  described 
kv  about  35  syntax  equations  for  Version  A and  about  30  syntax 
equations  for  Version  B:  the  difference  is  accounted  for  by  the 
fact  that  Version  A contained  report  qeneration  capabilities  that 
Version  5 did  not.  Typical  queries  that  could  be  accoaaodated  by 
either  systea  arc: 

"TOTAL  0UANTITY  W H EH  E TYPE  EQUALS  NUCLEAR  AND  CCUNTBY  EQUALS 
USA." 

"PRINT  TYPE  W HEN  E MISSILES  GREATER  THAN  SEVEH." 

Both  of  these  initial  versions  of  VDHS  had  been  tested  with  a 
larqe  nuaber  of  utterances  and  had  achieved  reasonably  qood 
results  (Bernstein,  1974).  The  first  aalor  task  undertaken 
durinq  this  contract  year  was  the  construction  of  a sinqle 
full-scale  version  of  VDHS  that  coabines  the  best  eleaents  of  the 
two  versions.  This  new  version  of  VDHS  (Ritea,  1974a, b)  was 
coapleted  and  successfully  deaonstrated  in  late  1973.  Its  aajor 
characteristics  ace  described  in  this  section. 


2.2.1  SirS-tQ*  UjffiWifiN 

The  overall  conf iquration  of  VDHS  is  characterized  by  three  aajor 
Piocessinq  aodules: 

(1)  The  lir.quistic  processor,  which  contains  the  parser  and 
a diecourse-le vel  controller; 

(2)  The  acoustic-phonetic  processor,  whose  results  are 
contained  in  an  array  of  data  called  the  A-aatrix; 

(3)  The  lexical  Batch inq  procedure,  which  perforas  Batches 
of  predicted  words  at  the  syllable  level,  usinq  various 
applications  of  phoncloqical  rules  to  assist  in  its 

na  fcchinqr*. 

The  pattern  of  coaaunication  aaonq  these  aodules  is  illustrated 
in  FiQure  2-1.  The  speech  froa  the  user  is  input  to  the 
acoustic-phonetic  processor,  whicn  foras  an  array  of 
acoustic-phonetic  data  for  uae  by  the  parser.  At  the  beqinninq 
of  the  procescinq  of  an  utterance,  the  discourse-level  controller 
provides  a variety  of  predictions  and  restrictions  on  what  is 
allowable  or  expected  in  this  utterance.  The  predicted  words  ar 
transmitted  to  the  lexical  satchlnq  procedure,  which  looks  for 
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1'iqur#  2-1.  Overview  ot  vr,r; 


r t‘-“  « - r-xvrr.  ::  b^ 

•are;,"*  **-*«► is  *•  «*• 

, “^uL-  i£-I ™Fr  - - 1 j- 

tu*.ii,.  pi  *><Jict  iors.  1 1 iltr  la  then  uprlatod  to  aid  in 

iu:“:rLt"riS1.,2iii;«  tjyji  f,t  ;>r‘"n" 

» Jr  be  todiii«<j  without  aaloi  P**?S  aronq  *ody**?s  in  VD«S 


t ii  it 


4lloaa  for  logical  parallel  execution  ot  aoJm«.a. 
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'.2)  it  provides  tor  the  running  of  modules  on  reaote 
compute  cs. 

( J)  Jsina  d trace  and  debugging  provision,  breakpoints  can 
be  inserted  for  sonitoring  the  flow  of  data  through  the 
system. 

(41  Changes  in  the  order  oi  execution  of  the  various 
modules  idv  )e  specified. 

(S)  Lata  dependencies  ancnq  modules  cay  be  controlled. 

A pc  r»  detailed  discussion  of  CSl  is  given  by  ftarnetr  (1974b). 


2.2.:  rnc  *,iicfiur5fetlcisl  Cfiotisilex 

Th*  discourse-lev**!  controller  comprises  two  soiuies:  the  user 

ec(**1  mil  tfu*  'nematic  memory.  The  user  acdel  cetcrminot;  shat 
*no*rv  state  t.i.*  user  it,  in  and  predicts  the  kinds  of  grammar  that 
riv  appear  u,  hi.*.  !.**rt  i nt.  fraction  with  the  system.  Sors  - sample 
states  are  vst.  s .oqin",  "interactive  guery  mode",  "import 
qvrieiation  rode",  and  "user  aids".  If  the  user  is  in  interactive 
cuei  v rod*-.  Uu  user  j.oiU-1  will  predict  syntax  eqiw*ions  for  the 
n--x»  ifjt  ruction,  sue)  as  tnose  for  "Print",  "Fepeat",  "count", 
"Luhr.f  f ",  ji  "Total",  i mh  of  which  is  the  first  word  of  an 
in*-r  »c»  i tfr  uu*  rv  ota  t*- c^nt.  The  words  "Explain"  and  "D--*3T  ibe" 
ar  * » n*  first  words  of  typical  user-aid  commands.  Fach 
nr  *1  ic  t ion  c-.iriiH  with  ].t  a confidence  level  such  tha*  the 
r.iat.r'c  »he  cintun-nct1  level,  the  aore  liberal  the  syst*n  will  he 
in  overlooking  eirois  in  recognition. 

The  th»?at  ic  mentor-/  is  concerned  with  particular  content  words 
tnt  light  occur  in  * he  next  utterance;  it  is  not  concerned  with 
syntactic  termin' 1'.  such  as  the  digits  or  the  word  "Print". 

a)  pi«coi,  oi  information  are  kept  about  each  word  as  it  is 
•*.*  i:  t'.q..  how  *on«  (how  many  utterances  ago)  it  has  been  since 
ti.<;  wuid  was  uscu  >is,d  how  likely  it  is  that  ♦he  word  will 
r < • - v«*r u r , leoendir.g  on  how  it  was  used  originally.  For  exaaple, 
if  th*  u .*'i  said  " i ‘w  category",  the  assumption  is  that  the  next 
coaaan-*  will  pro.  i i;  1 v involve  something  about  the  categories  of 
3u  briar  in  f'b  in  t h-  lata  last. 

In  adJitior  *o  loo* in i for  content  words  in  user  commands,  the 
thematic  B'-oocv  a l :;o  keeps  a recorl  of  any  non-numcric  symbolic 
respon-.ee-  tr<,m  *i  data  management  systea.  These  responses  are 
also  used  to  predict  words  tnat  are  highly  likely  to  occur  in  the 
lit  xt  ut t el  alien. 
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Throughout  a dlaloque,  the  various  content  words  as  predicted  by 
the  thesatic  seaorv  are  aqed  froa  utterance  to  utterance,  and 
their  likelihood  oi  beinq  used  is  disinished  if  they  have  not 
been  reused.  If  the  confidence  of  prediction  drops  below  a 
threshold,  then  the  word  is  resoved  iron  the  thenatic  semory  and 
dropoed  froa  consideration  until  it  is  used  aqain.  Also,  if 
duplicate  entries  occur  within  an  utterance,  then  the  aqe  and 
orioinal  serit  are  aodified  to  take  care  of  this  effect. 


2.2.3  ike  Uicer 

the  basic  linquistic  unit  used  for  the  parsinq  strategy  in  VDHS 
is  the  phrase,  which  consists  of  one  or  core  vocabulary  words  (up 
*o  t h«'  cow  Die  to  utterance)  linked  toqether  in  a syntactically  and 
semantical  lv  correct  order.  Sose  exaaples  of  phrases  are 
"country  and  cateqory"  and  "quantity  equals  five".  The  parser 
attempts  to  predict  phrases  usinq  the  user  sodel,  thenatic 
ratte-.nira,  and  grammatical  and  seaantic  constraints  inforaation 
provided  by  the  discourse -level  controller.  Predicted  phrases 
are  watched  aqainst  the  acoustic- phonetic  data  for  acceptance  or 
rolection.  Accepted  phrases  are  then  concatenated  to  fora  a 
laroct  phrase,  which  is  then  analyzed  to  sec  whether  it  is  a 
complete  utterance. 

Tro  parser  consists  of  four  salor  nodules: 


(1) 

The 

classifier 

(2) 

The 

bottos  driver 

(3) 

The 

top  driver 

(4) 

The 

side  driver 

Tho  classifier's  task  is  to  assign  a syntactic  cateqory  to  each 
word  accepted  by  the  lexical  natchinq  procedure.  Sose  typical 
syntactic  categories  and  examples  are: 

Itea  name  ("country") 

Ites  value  ("USA") 

syntactic  terminal  ("print") 

A syntactic  category*  as  qenmrated  by  the  classifier,  is  used  by 
the  other  modules  of  the  parser  to  qenerate  predictions  about 
allowable  syntax  in  other  parts  of  the  utterance.  The  bottos 
driver  is  -»  typical  hottoi-up  module,  uhich  takes  found  phrases 
and  determines  how  they  amy  be  used  in  cospleting  the  parsinq  of 
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a cotplete  utterance.  The  top  driver  takes  predicted  phrases  and 
fro»  then  derives  either  a syntactic  terainal  or  a shorter  phrase 
to  be  looked  for  next. 


Fiqure  2-2.  The  Parser 


Tho  syntactic  tetninals  *.  :e  sent  to  the  lexical  satchinq 
procedure.  which  then  at^apts  to  sate,  each  one  against,  the 
acoiisu  -phonetic  data.  The  side  driver  takes  cos  pie  ted  or 
partially  conpletod  phrases  from  the  bott  * driver.  If  a phrase 
is  incomplete,  the  side  driver  determines  which  part  to  look  for 
next,  and  it  will  ask  the  top  driver  to  locate  the  sissing  part, 
on  the  other  hand,  if  the  phrase  has  been  completed,  the  side 
driver  analyzes  it  t.o  sen  whether  it  is  a leqal  complete 
utterance.  If  it  is,  the  side  driver  terminates  the  parsinq 
activities  of  all  nodules  and  transnits  the  symbolic  fors  of  the 
hypothesized  utterance  to  the  lata  sanaqenent  systen  and  the 
discourse- level  controller.  Other  completed  phrases  (w^ich  do 
not  cover  the  entire  utterance)  are  used  to  bottoa  drive  the 
-svstew  up  one  level  to  create  larger,  lore  complete  phrases.  The 
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flow  of  procesainq  within  tha  parser  is  shown  in  Figure  2-2. 


i.i.k  The  lexical  JAichUa  ££&£sditfs 

The  lexical  aatcning  procedure  verifies  or  relects  a predicted 
word  throuqh  pattern-aatchinq  aqainst  the  available 
acoustic- phonetic  data.  A detailed  description  of  the  procedure 
is  qiven  by  Weeks  (197JU,  b). 


The  syllable  is  the  unit  that  is  used  in  t«*e  lexical  matching 
process.  The  linquistic  issues  concerning  the  existence  or  fora 
of  the  syllable  have  been  sidestepped  by  qivinq  it  the  following 
algorithmic  definition:  a vowel  nucleus  preceded  by  a o~*isonant 

cluster  (possible  null)  and  followed  by  another  consonant  cluster 
(also  possibly  null).  All  words  have  syllable  divisions  narked 
ir.  the  lexicon,  and  soae  of  the  phoneaic  rules  are  written  in 
teras  of  these  boundaries.  Within  a syllable,  aost  of  the 
co-articuldt ion  is  internal; 


Fiqure  2-3.  Lexical  Hatching  Procedure 
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effects  over  boundaries  are  handled  aaparately.  Because  a qood 
percentaqe  of  phonetic  dependencies  occur  within  these  units* 
rules  can  conveniently  be  applied.  Under  this  approach,  a 
separate  set  of  roles  sust  be  set  up  for  dealinq  with 
interactions  over  boundaries,  lord  boundaries  can  then  be 
considered  as  special  cases  of  syllable  boundaries,  so  that  the 
roles  apply  to  inter-word  co-articulation. 

Piqure  2-3  is  a block  diaqren  of  the  lexical  aatchinq  procedure. 
When  a word  is  predicted  in  orthoqraphic  fora,  its  phoneaic 
representation  is  extracted  fros  the  lexicon,  k set  of  phonenic 
rules  is  applied  to  the  phonesic  representation  to  obtain  a set 
of  lexical  variants.  These  variants  arise  by  phonenic 
reolaceaents  as  dictated  by  the  rules.  The  set  of  lexical 
variants  is  then  sent  to  the  sain  aatchinq  procedure,  where  each 
is  aatched  one  by  one  aqainst  the  acoustic- phonetic  data  on  a 
syllable-by- syllable  basis.  The  boundary  analysis  is  done  in 
conlunction  ith  the  syllable  aatchinq  and  atteapts  to  coapensate 
for  articulation  across  syllable  and  word  boundaries.  The 
rosultinq  scores  for  each  lexical  variant  are  then  sent  to  the 
sain  aatchinq  procedure,  which  decides  which  possibility  qives 
the  best  over-all  score.  The  structure  of  the  phonetic  roles 
pass  is  described  oy  Barnett  (1974a). 


2.2.5  Processor 

Speech  is  input  interactively  in  an  acoustically  controlled 
environaent  (a  sound  booth  with  a siqnal- to- noise  ratio  better 
than  50  dB)  usino  a Sony  ECfl-377  condenser  aferophone,  which  has 
an  essentially  flat  frequency  response  to  beyond  10,000  Hz. 
Lew-level  preenphasis  is  saployed,  ahapinq  the  frequency  response 
with  a zero  at  300  Hx  and  a pole  at  3,000  Hz.  This  speech  siqnal 
is  handlivited  to  9,000  Hz  and  diqitizad  at  20,000  staples  per 
second  usinq  a 12-bit  analoq-to-diqital  converter.  The  input 
speech  is  saved  directly  on  diqital  ledit  with  no  interveninq 
analoq  recocdinq  steps. 

The  diqitized  speech  is  then  passed  throuqh  a diqita 1-to-analoq 
converter,  and  the  resultinq  analoq  wavefora  is  passed  throuqh 
three  hardware  filters  havinq  noainal  bandpasses  of  150  to  900 
Hz,  900  to  2,200  Hz,  and  2,200  Hz  to  5,000  Hz,  respectively.  For 
each  10-asec.  interval,  two  paraaoters  are  extracted  froa  each  of 
the  three  filter  outputs:  (1)  the  aaxiaua  peak-to-peak 

amplitude  and  42)  a count  of  zaro  crosslnqs.  The  resultinq  six 
paraaeters  <twy  froa  each  of  the  three  filtered  siqnals)  are  used 
to  assiqn  a rough  acoustic  label  to  each  10-asec.  seqaent.  Five 
labels  ace  currently  used:  VV  (vowel-like),  SS  (atronq 
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frication)  , SI  (silence).  Of  (low-amplitude  voiced  or  unvoiced), 
and  VC  (all  other — usually  weak  voicinq).  The  next  step  in  the 
processing  is  to  refine  these  rouqh  labels,  iaposinq  a sore 
accurate  classification  on  each  sequent. 

within  the  classes  VH  and  VC,  note  specific  labels  are  assigned 
usina  a vowel-recognition  strategy  based  on  speaker-dependent 
vowel  foraant  intonation  (Kaueny  and  Meeks,  1974).  This 
inforaation  is  coapared  with  the  foraants  (obtained  with  the  use 
of  a Linear  Predictive  Coefficient  (LPC)  spectrum)  at  selected 
instants  in  tiae  for  each  vowel  to  be  identified.  A aodified 
Euclidean  distance  function  is  used  to  coapute  the  relative 
distances  between  the  candidate  foraant  values  and  pre-stored 
speaker- dependent  vowel  foraant  values.  The  closest  three  vowels 
are  selected,  and  associated  scores  are  assiqned  to  these  choices 
based  on  the  values  of  the  distance  function. 

Fricatives  and  plosives  are  characteristically  found  within 
sequences  of  scqaents  labeled  SS,  VC,  or  OV.  For  these  areas,  a 
technique  called  the  Low-Coefficient  LPC  (LCLPC),  described  by 
folho  (1974a, b),  has  been  shown  to  provide  aeaninqful  spectra 
that  correspond  well  with  both  acoustic-phonetic  theory  and  with 
the  experimental  results  of  others.  Tise  resolution  is 
sufficiently  narrow  to  allow  independent  spectral  analysis  of  the 
release,  frication,  and  aspiration  portions  of  an  unvoiced 
plosive  or  to  desonstrate  spectral  chanqe  within  a consonant 
cluster,  so  that  clusters  such  as  /ks/  and  /ts/  say  often  be 
distinguished.  For  analysis  of  unvoiced  speech,  the  LCLPC  uses 
the  autocorrelation  set  hod  with  eiqht  coefficients  and  a 6-asec. 
Havning  window.  Analysis  of  spectra  obtained  in  this  way  allows 
the  following  five  classes  to  bt  distinguished: 


(1) 

labial  or  dental 

(LD) 

(2) 

alveolar  (AL) 

(3) 

alveopalatal  (AP) 

(4) 

palatal  or  velar 

(PV) 

(5) 

voiced  or  low  energy  (VS) 

These  classes  correspond  rouqhl?  to  the  spectral  characteristics 
of  unvoiced  fricatives  and  plosives.  Roreover,  there  is  a 
correspondence  between  these  classes  and  the  articulatory 
positions  of  unvoiced  fricatives  and  plosives.  The  classes  LD, 
AL.  AP.  and  P V ideally  contain  the  following  phoneses: 


LD:  /p/,/f/.  /*/ 
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AL:  A/,  /&/ 

AP:  /// 

PV:  A/ 

Experiment cation  has  also  confined  that  the  qlide  /a/  and  the 
liquid  /!/  characteristically  occur  within  the  VW  or  VC  classes. 
The  present  approach  to  recoqnizinq  these  phoneaes  is  to  auqaent 
a speaker's  vowel  foraant  table  with  the  foraant  frequency  values 
for  /¥/  and  /\/.  These  foraant  values  have  consistently  been 
easily  distinguishable  froa  the  foments  of  the  vowels  and  have 
enabled  the  svstca  to  accurately  isolate  and  recoqnize  /w/  and 
/!/.  The  qliJe  /y/  and  the  liquid  /r/  are  handled  indirectly, 
aqain  with  the  use  of  the  speaker-dependent  vowel  foraant  table: 
if  a 10-asec.  segment  has  been  labeled  /i/,  it  is  assuaed  that 
the  seqaent.  could  be  a /y/  with  equal  probability,  and  both 
labels  are  then  assigned  to  the  seqaent  with  the  saae  score.  If 
a seamen  t.  has  been  labeled  /£/  (aqain  with  the  aid  of  the  vowel 
table),  the  label  /r/  is  assiqned  to  the  saae  seqaent  with  an 
eoual  score. 

Alt) ouqh  the  r.vstoa  is  not  yet  able  to  distinquish  the  various 
elements  within  the  class  of  nasals,  viz.,  /a/,  /n/,  /»»/,  /j/, 
/y/.  a single  class  name  <NA)  is  used  and  has  proved  quite 
reliable.  A seqaent  is  labeled  NA  based  upon  soae  siaple  tests 
involving  the  aaplitudes  and  bandwidths  of  foraants  PI,  P2,  and 
F3. 

All  of  the  aforementioned  seqaent  labelinq  procedures  are  used  to 
construct  an  array  of  acoustic-phonetic  data  called  the  A-aatrix. 
The  construction  of  the  A-aatrix  is  shown  in  Piqure  2-4.  Each 
row  ot  the  A-aatrix  corresponds  to  a 10-asec.  seqaent  of  speech 
and  contains  a rouqh  seqaent  label  (VW,  SS,  SI,  VC,  or  UV) ; one 
or  aore  refined  seqaent  labels  and  associated  scores  based  on  the 
above  procedures:  foraant  frequency  values;  and  »stiaates  of 
fundamental  frequency,  RHS  enerqy,  and  other  acoustic-phonetic 
parameters  us><3  in  the  acsiqnaent  of  the  phoneme  and 
phoneme-class  labels. 
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Figure  2-4.  A-Batrix  Processing 
2.2.6  SliM  TfiSJtlpg 


In  tha  present  configuration*  speech  is  digitally  recorded  and 
saved  on  disk,  as  described  above*  using  the  Raytheon  704.  The 
704  then  creates  an  A-satrix  fros  the  digitized  vavefora.  The 
A-aatrix  is  then  sent  (via  direct  hardware  link)  to  the  IBB 
370/145,  which  then  perforaa  all  subsegoent  linguistic  processing 
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of  the  utterance  and  returns  a response  to  the  user.  Since  the 
system  is  dependent  upon  theaatic  patterning  for  assistance  in 
understanding  an  utterance*  it  is  necessary  for  the  user  to 
interact  with  VDWS  usinq  qoal-directed  dialoqs.  For  testing 
purposes,  ten  dialoqs  were  created*  vith  an  average  of  tea 
utterances  per  dialoq.  Each  of  two  tale  speakers  (for  whon  vowel 
toraa.it  tables  had  previously  been  qsnerated)  recited  the  sets  of 
dialoas.  I.  .ms  initial  test  of  VDHS*  an  averaqe  of  52*  of  the 
utterances  were  correctly  understood.  Analysis  of  these 
rreliainarv  results  has  shown  that  this  fiqure  can  be  increased 
by  implesentinq  soae  nod  if ications  to  the  phonoloqical  processes 
ar. ) lexical  aatchinq  procedure. 


2.2.7  pelaisd  JessAieh 

A continuing  proqraa  of  basic  acoustic-phonetic  research  is 
erovidinq  aiqorithss  that  will  iaptove  the  over-all  accuracy  of 
current  and  future  spe.-ch-understandinq  systeas.  An  experiment 
deslqned  to  cospare  the  FI  and  F2  frequency  sevesents  of  vowels 
next  to  /t/  with  the  same  vowels  before  other  consonants  was 
conducted  craaeny.  1974).  Lehiste's  (1964)  data  (obtained  fros 
spoctroqra  as)  on  the  vowel  allophones  associated  with  /r/  were 
use)  for  cosparison  purposes.  The  data  for  this  experisent  were 
basci  on  foraant  tralectories  coaputed  by  LPC  techniques  on  the 
?avth*»on  704.  The  results  of  thi^  experiment  confirmed  Lehiste's 
work,  which  indicates  that  there  ' ; a chanqe  in  soae  vowels  in  a 
cetrcfloxed  r nvi rot'sent.  The  chanqe  in  vowels  after  /r/  is 
ainirai  except  for  /if,  but  thn  chanqe  in  vowels  before  /r/  is 
considerable.  T1  is  was  a preliminary  experiaent  in  whicn  the 
number  of  sublects  ard  .samples  was  snail.  However*  t nc  results 
c»;i  be  used  to  develop  a retroflexed  vowel  space  on  the  basis  of 
o non-retrof 1« xed  vowel  spare  and  to  coapare  the  identification 
of  vowels  usinq  this  new  P1-F2  space  with  the  identification  of 
vowels  usinq  the  non-re trofle xed  F1-F2  space. 

At;  algorithm  that  automatically  distinguishes  the  nasals  /n/* 

/»/.  and  /v/  from  each  other  was  desiqiud  (Gillaann*  1974  ; 
Gilinann  and  Bite.',  1474).  Spectral  analysis  is  performed  on  the 
Favtieon  704  usinq  an  LPC  aod-il  to  locate  the  foraants  of  those 
phone*'  s.  By  coaparin<t  n>e  tonaot  frequencies  of  unknown  nasals 
to  prototype  values  derived  free  noraalization  utterance,  the 
algorithm  was  able  to  correctly  identify  nasals  in  72%  of  the 
cii'o.f,  tested.  Exper iaont at  ion  has  indicated  that  (1)  automatic 
techniques  can  he  employed  to  distinguish  nasals  in  continuous 
sucech:  (2)  linear  prediction  can  be  used  effectively  to  analyte 

the  speetta  of  these  phonercs;  and  (3)  speaker-dependent  tables 
of  prototype  nasal  foraants  extend  these  results  to 
multiple-speaker  environments. 
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2.3  PROGRESS  FOR  THE  PINAL  SIX-HOMTH  PERIOD 

The  original  qcal  to  bo  rsached  by  the  end  of  the  current 
contract  year  (as  specified  in  SDC  Proposal  73-5674)  consisted  of 
enlarqinq  the  vocabulary  of  the  query  lanquaqe  and  looseninq  the 
uraasar  to  sake  the  lanquaqe  easier  and  sore  natural  to  use. 
Specifically.  VDHS  was  to  contain  a vocabulary  of  about  500-600 
words,  and  the  qransar  was  to  he  aodified  to  adaLt  the  followinq 
capabilities: 

(1)  A sore  natural  lore  of  expression  for  integers  (for 
exanple,  "thirty  four"  instead  of  "three  four"). 

(2)  A facility  for  inter-ites  comparisons  (for  example. 
"Print  catpqory  where  surface  speed  qreater  than 

ufiwerqed  speod.")  . 

(J)  oi  strinqs  of  inequalities  (for  exaaple,  "Print 

typ.-  where  draft  qreater  than  seven  and  less  than 
r.i»e."i  . 

<41  ;implo  arithmetic  calculations  (for  cxaaple,  "Print 
cateqorv  where  surface  speed  qreater  than  three  tines 
submerged  speed."). 

(5)  Interrogative  sentences. 

However,  recent  discussions  and  negotiations  (at  the  request  of 
ARPAi  with  the  Stanford  Research  Institute  (SFI)  have  yielded  a 
cooperative  research  plan  for  further  developaent  of  a loint 
speech  understanding  systen.  Within  this  plan,  SDC  is 
concentrating  priaartly  on  siqnal  processing,  acoustics, 
phonetics,  phonology,  and  systen  software  and  hardware  support. 
SRI  is  concentrating  on  syntax,  senantics,  praqnatics,  and 
discourse  analysis.  Systen  design  and  architecture,  and 
prosodies,  are  shared  concerns. 

The  first  valor  goal  to  be  achieved  under  this  cooperative 
research  plan  will  be  the  successful  developaent.,  implementation, 
and  demonstration  of  a speech  understanding  systen  in  late  1974. 


2.3.1  DsnopfftS At  iQD  SlSlSI  QlSlliSM 

for  the  initial  inplenentation  of  the  1474  denenstra tion  systen, 
which  operates  on  the  Raytheon  704  and  IBK  370/145  conputers,  the 
task  doaair  is  data  aanaqenent,  and  the  data  base  consists  of 
infornation  on  the  subnarines  fleets  of  the  United  States,  the 
United  Kinqdon,  and  the  Soviet  Union.  The  acoust ic- phonetic 
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processing  and  lexical  mapping  routines  are  essentially  as  used 
in  VDHS,  modified  to  handle  a vocabulary  of  300  words.  There  is 
a word-string  mappinq  procedure  that  handles  coarticulation 
between  pairs  of  words.  Tho  parser  is  a aalor  revision  and 
extension  of  the  previous  SRI  parser  (Paxton*  1974)  # in  which 
sources  of  knowledge  are  separated  fron  the  procedures  for 
applvinq  thos.  A best-first  strateqy  still  prevails,  but  it  is 
now  possible  to  start  froa  any  fixed  point  in  the  utterance,  to 
skin  over  portions,  and  to  accept  input  fron  word-spottinq 
routines.  The  qraeaar  encoapasses  that  of  the  previous  SRI 
system  but  has  been  extended  to  cover  isolated  noun  prases  and 
noainals.  In  addition  to  beinq  independent  of  the  parser,  it  has 
boon  rewritten  as  a series  of  context-free  rules  with  factors 
thar  specify  restrictions  or  conditions  on  rule  application;  as  a 
result,  it  can  b used  top-down,  bottoa-up,  or  with  aissinq 
segments.  The  ienantics  have  been  conpletely  revised  froa  the 
previous  S* r s sten.  mow,  inforaation  is  stored  in  a network 
representation:  corrcspondinq  to  each  syntactic  rule  there  is  a 
semantic  int  rpretation  rule  that  operates  on  the-  network.  A 
praq Katie  component,  based  on  an  analysis  of  protocol  studies, 
has  been  added  to  handle  anaphora  and  ellipsis  and  to  provide 
discourse  constraints  for  pcocessinq  dialoq  dependencies. 

A ithouqh  the  acoustic-phonetic  processor  and  the  system  hardware 
and  software  will  underqo  little  or  no  chanqe  for  this  system,  a 
fair  mount  of  research  has  been  accomplished  in  these  areas. 


2.3.2  6£9.U£XiCrEh9Q*iic  JteilSifttCli 

For  purposes  of  acoustic  feature  extraction,  two  proqrams  were 
developed:  one  for  automatic  formant  frequency  analysis  and 

another  for  fundamental  frequency  extraction. 

The  formant  frequency  analysis  proqram  assiqns  frequency, 
amplitude,  and  bandwidth  to  each  of  the  first  three  formants  for 
each  10-ms«c.  voiced  seqment  of  continuous  speech.  Its  input 
parameters  ate  fundamental  frequency,  RNS,  and  up  to  five 
spectral  ptiks  noiow  5,000  Hz.  with  their  respective  amplitudes 
and  tandwidths.  Peak  information  is  obtained  fron  linear 
prediction  spvctia.  Techniques  dist inquishinq  this  formant 
trac^et  from  previously  reported  ones  are  that:  (1)  all  spectrua 

computations  arc  accomplished  in  the  peak-pickinq  phase,  prior  to 
formant  t lacking  (this  may  yield  spurious  formants  but  few 
missing  formants)  : (2)  anchor  points  are  located  by  sclectinq 
three  consecutive  10-msec,  segments  in  which  three  possible 
formant  frequencies  do  not  differ  from  one  seqment  to  the  next  by 
voi^  than  a threshold  amount:  and  (3)  decision-makinq  im  aided  by 
ttcuuencv  pattern  matching  when  more  than  one  formant  la  possible 
tor  a given  slot  (this  is  particularly  useful  for  nasals  and  /l/ 
a tel  /w/) . Frequency-pattern  inforaation  im  derived  froa  speaker 


15  Hoveaber  1974 


9 Systes  Development  Corporation 

TH- 524 3/002/00 


vowel-sonorant  frequency  tables. 

Fundamental  frequency  is  extracted  by  a three-staqe  process.  The 
speech  is  first  diqitally  low-pass  filtered  and  dovn-saapled  froa 
20.000  saaples  per  second  to  2.000  per  second.  Autocorrelation 
spectra  are  then  taken  every  10  aaec.  usinq  what  Skinner  (1973) 
describes  as  the  "end-off M aultiplication  technique.  Finally,  a 
eitch-tr ackinq  pass  extracts  peaks  froa  these  spectra,  refines 
then  by  parabolic  curve  fittinq,  and  asseables  these  values  into 
a coherent  pitch  track  by  editinq  out  octave  errors  (aistakinq  a 
haraonic  cr  sufcharaonic  for  the  fundaaental  frequency)  and 
isolated  anoaalies. 

In  addition  to  acoustic  feature  extraction,  research  was 
conducted  on  the  acoustic  correlates  of  style  of  speech.  1 aajor 
problea  in  speech  research  has  been  the  selection  of  test 
material  ter  both  acoustic- phonetic  experiaenta tion  and  speech 
understand  inq  system  exercisinq.  Sose  exper iaent  ers  favor  the 
use  of  read  speech:  others  favor  the  use  of  spontaneously  spoken 
speech.  However,  notl.ir.q  had  been  done  to  deteraine  the  acoustic 
effects  of  these  different  styles  cf  speech  cr  to  deteraine 
whether  thore  veto  actual  differences.  Our  qoal  was  to  construct 
a carefully  designed  experiaent  in  which  both  read  and 
spontaneous  (as  well  as  other  speech  styles)  could  be  compared. 

To  this  end,  recordings  were  aade  of  ten  speakers  of  '‘California 
Fnalish"  producinq  the  set  of  teat  words:  "bee,  bow,  hoy,  bed, 

bad,  bud"  ic  seven  different  styles  of  speech,  as  follows: 

(1)  Free  speech  durinq  an  interview  in  which  subjects  were 
induceu  to  cay  the  test  words  without  their  having  been 
previously  spoken  by  the  experimenter; 

(2)  Spontaneously  spoken  lists  of  words; 

(3)  Spontaneously  produced  sentences; 

(4)  Repetition  of  sentences  spoken  by  the  experimenter ; 

(5)  Readinq  a continuous  passaqe; 

(6)  Readinq  the  test  words  in  the  sentence  "The  word  is 
...  .";  and 

(7)  Readinq  the  test  words  in  lists. 

Each  word  was  sade  to  occur  in  phrase  final,  stressed  position. 
The  data  were  analyzed  usinq  the  forsant  frequency  analysis 
proaras  to  deteraine  the  frequency,  asplitude,  and  intensity  of 
the  first  four  foraants.  The  nucleus  of  each  vowel  in  each  word 
in  each  style  was  determined  by  an  alqocitha.  The  nucleus  of  the 


15  November  1974 


20  Systea  Development  corporation 

TH-5243/002/00 


vowel  differed  in  sol_>  styles  of  speech.  Preliminary  conclusions 
indicate  that  differences  in  the  foraant  structure  of  vowels  in 
read  and  spontaneous  speech  do  exist. 


2.3.3  sjrs^f  uajd.yft£e 

Acoustic  feature  extraction  and  phonetic  processinq  for  future 
SDC/5RI  speech  underst andinq  systeas  will  be  done  by  linked 
PDP-11/40  and  SPS-41  computers.  (Riper iaent at  ion  to  determine 

what  processinq  is  required  will  continue  to  be  done  or.  the 
Raytheon  704. j The  PDP-11/40  was  delivered  in  June,  we  are 
currently  awaitinq  a completed  version  of  the  ELF  operatinq 
system  for  the  PDP-11/40,  which  is  beinq  prepared  by  SCRL.  Once 
this  is  received,  we  will  implement  a number  of  user-level 
proqrams,  which  are  now  in  a desiqn  staqe.  Delivery  of  the 
SPS-41  is  currently  scheduled  for  late  November. 

in  late  ly73.  an  ARPANET  interface  for  the  PDP-11  was  developed 
at  S DC.  Designated  the  hSI-IIA,  this  interface  has  been 
operational  at  SDC  since  January,  1974.  In  March,  1974,  the 
designer  (Lee  Mol ho)  was  asked  by  the  ARPA  Interface  Committee 
(T SO  to  submit  HSI-IIA  for  possible  selection  as  a standard 
ARPANET  interlace  for  PDP-11  computers.  In  nay,  the  HSI-IIA 
design  selected.  For  several  months  thereafter,  ISC  members 
a r.d  tn-  3 PC  atari  conducted  technical  discussions,  primarily  by 
ARPANET  "Network  Mail".  The  purpose  of  these  discussions  was  to 
specify  in  hsi-IIb  desiqn  suitable  for  production  py  some 
organization  rot  widespread,  qeneral  use  on  the  ARPANET.  These 
d iscussions  resulted  in  11  enqineerinq  chanqes  to  t tie  oriqinal 
hSI-IIA  dtisiqr  in  order  to  meet  ISC  requirements. 

In  addition  to  system  hardware  and  software  efforts  on  the 
PDP- 1 1 /SPS-4 1 computet  configuration,  system  support  work  for  the 
3UP  effort  is  being  done  in  the  form  of  the  development  of  a 
proqrumminq  language  and  system  that  is  specifically  designed  for 
the  implementation  of  SUR  systems.  This  lanquaqe,  called  CRISP, 
will  be  operational  in  July,  1975,  ou  the  IBM  370/145  under 
V!*/37C.  A first  draft  of  the  lanquaqm  and  system  desiqn  document 
will  be  completed  in  December,  1974.  CRISP  offers  a set  of 
capabilities  that,  taken  toqether,  make  it  a uniquely  appropriate 
tool  for  the  implementation  of  our  speech  understandinq  systems. 
Amonq  these  capabilities  are: 

• A structured  data  capability  similar  to  that  available  in 
PL/I. 

• Flexible  pointer  manipulation  similar  to  that  in  LISP, 
including  functicnalm. 
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• sulti-rrocesBirq  and  "spaqhett i- stack"  primitives. 

• Efficient  compilation  of  both  arithmetic  and  pointer* 
aanipulat ion  alqorithaa  (incremental  and  batch  modem). 

• Three  levels  of  extensible  lanquaqes  available  to  the 
user: 

(1)  Source  Lanquaqe  (SI) --an  ALGOL-like  xanquage  with 
infix  operators. 

(2)  Internal  Lanquaqe  (IL) — A LISP-like  Polish-prefix 
list  structure  lanquaqe. 

(3)  Assembly  Lanquaqe  (CAP) —a  macro-assembly  lanquaqe. 

• Availability  of  dyusalc,  local,  and  ovn  variables. 

• Kane  poolinq. 

• System  aids  to  better  utilise  virtual  aeaory  resources. 

• A variety  of  aids  for  qroup  construction  of  larqe 
progress. 

Also  being  prepared  is  a translator  proqras  that  converts  SDC 
Infix  LISP  to  CPISP/Sl. 

osinq  CFISP.  the  bottom-end  numerical  algorithms,  mapping 
procedures,  and  top-end  component  may  all  be  combined  in  a sinqle 
language  end  system  without  loss  of  efficiency.  This  is 
advantageous  for  several  reasons,  the  moat  important  beinq  the 
increased  ability  of  the  nodules  to  coordinate  and  cossunicate 
with  one  another. 


2.4  PLAIS 

The  obleetive  for  the  1974-1975  contract  year  is  the  successful 
operation  of  a milestone  system,  developed  jointly  by  SDC  and 
3 P I.  capable  of  handlinq  utterances  in  ordinary  Snqllsh 
appropriate  for  task-oriented  dialogues  of  the  data  aanaqeaent 
tasP.  This  milestone  system  will  have  the  following 
ch iractoristics: 

• A vocabulary  of  approximately  600  words. 

• Ability  to  accommodate  mix  apeakera  (male  and  female) . 

• Response  to  the  user  in  sbout  25  tines  real  tiae. 
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• Opera tinq-environsent  siqnal/aoise  ratio  of  approsisately 
30-40  dB. 

• Accuracy  in  undarstaadinq  at  the  utterance  level  of  90X. 

for  this  loint  syates,  SDC  will  aasuae  priaary  reaponsiti lity 
for: 

• Siqnai  processing  alqorithas,  which  will  perfora  acoustic 
feature  extraction  froa  the  wavefora  on  the  PDP-11/40  and 
SPS-41  coaputers. 

• Acoustic-pnor.etic  analysis,  which  will  include 
investigations  ot  voiced  fricatives  and  plosives, 
seqrcnt at  i«" n 'echniques,  ind  developaent  of  procedures 
for  <s ssiu'ii no  phonetic  features  to  vowels,  nasals,  and 

so  not  ant  f-  . 

• Lexical  r a tc  ti  i r.q  procedures,  which  will  consist  of  the 
dovi  iop*»  r»  ot  bottoa-drivinq  techniques,  lexical 
butsettinq  methods  tc  quickly  prune  unlikely  candidates 
i tea.  Lists  oi  proposed  words,  and  prosodic  sappinq 
techniques  tc  nap  phrases  usinq  prosodic  inforsation. 

• systea  hardware  and  software,  includinq  interconnection 
of  * ho  PLP-11/4C  and  SPS-41  coaputers,  associated 

softy  are  developaent,  and  the  creation  of  foraal  test  and 
valuation  oiocodur*-s  for  aodules  of  the  speech 
understar  dir.q  systea. 

SDC  and  sFI  will  snare  responsibility  for  the  systea  architecture 
anl  the  analysis  of  prosodic  intonation.  SRI  will  have  priaary 
responsibility  tor: 

• Protocols  arJ  discourse  analysis. 

• Parser  developaent. 

• Graaaar  ani  seaantics. 

The  activities  f c which  SDC  has  priaary  responsibility  see 
described  in  detail  in  SDC  Proposal  74-5490. 

T h*  >L1ective  f >r  the  1975-1976  contract  year  is  the  successful 
d.-a  t.tritior  : the  Five-Year  Systea  (Newell,  et  al.,  1971). 
fc.  this  systea,  a second  task  detain  will  be  added.  Each  dosaia 
w ’1  have  a vocabulary  of  1,000  words.  The  systea  will 
accwsaodate  about  30  speakers  and  will  run  on  the  PDP-11/40# 
PS-ul.  and  IBB  370/145  coaputers. 
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3.  mien  Bin  lacij iu 

3.1  INTRODUCTION 

The  Lexical  Data  Archive  (LDA)  project  has  addressed  itself  to 
the  task  of  providinq  the  ABPA  Speech  Onderst andinq  Research 
(SUR)  projects  with  seaaotic  and  syntactic  data  for  the  words  in 
their  lexicons.  L'oinq  devoted  exclusively  to  lexical  research* 
the  LOA  project  can  assure  a broad  ranqe  of  services  for  each  SOR 
project  without  the  triplication  of  effort  and  resources  that 
would  be  required  if  the  three  SUR  projects  were  to  collect  and 
analyze  these  data  theeselves.  The  project  is  nonitoring  a broad 
ranqe  of  lexical  data  sources,  selectinq  the  data  having 
potential  payoff  for  speech  understanding,  fornat.tinq  those  data 
for  archival  purposes,  and  providinq  for  their  dissesina tion  to 
the  appropriate  SOR  projects.  The  data  in  the  archive  are 
centered  on  the  3,  COO  or  so  words  appearinq  in  the  twelve 
lexicons  currently  being  used  by  the  SOB  projects  at  Bolt  Beranek 
and  Newnan  Inc..  Carneqie-Kellcn  University,  and  System 
Developaent  Corporation . Altbouqh  there  is  considerable  overlap 
asonq  the  lexicons,  the  words  treated  cose  frea  quite  disparate 
do  wains:  chess  playinq,,  analyses  of  soon  rocks,  subsarine  fxeet 

data,  project  nanaqeaent,  and  the  daily  news  releases  of  the 
Associated  Press. 


3.2  PROGRESS  AND  PRESENT  STATUS 

Since  the  initiation  of  the  LDA  project  in  October,  1973,  the 
followinq  six  tasks  have  been  pursued: 

(1)  Coapletiop.  of  the  desiqa  of  the  Sesantically  oriented 
Lexical  Archive  (SOLAR). 

(2)  Collection  of  data  fros  the  linguistics  and 
philosophical  literaturs. 

(3)  Developaent  of  coapu^er  proqrass  for  derivinq 
sachine-readable  data  files. 

(4)  Developaent  of  cosputer  proqrass  foe  discovering  and 
displaying  linko  asonq  words  in  particular  lexicons. 

(5)  Development  of  a file-aanaqeient  facility. 

(6)  Distribution  of  data  to  the  SOR  projects. 

These  tasks  art  described  in  the  following  sections. 
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3.2.  1 £OJUlfiUilA  St  2QJJJ1  fiUlflJI 

First,  the  desiqn  of  the  Semantically-Oriented  Lexical  archive 
(SOLAS)  has  been  completed.  This  includes  the  decision  as  to  the 
types  of  lexical  data  to  be  collected,  the  determination  of  the 
data  collection  procedures,  the  specification  of  proqrams  needed 
to  extract  data  from  machine- readable  transcripts,  and  the  desiqn 
of  the  loqical  structure  of  the  files  to  be  built.  In  accordance 
with  the  responses  from  the  SOB  qroups  to  a questionnaire, 
initially  distributed  in  June.  1973.  SOLAR  will  consist  of  ten 
files  (seven  of  which  have  been  iaplemented  wholly  or  in  part)  : 

(1)  A word  index,  which  allows  a user  to  easily  determine 
the  words  tor  which  data  are  beinq  collected  and  the 
types  of  data  currently  available  for  a qiven  word. 

(2)  A biblioqraphic  reference  file,  which  can  be  used  as  a 
resource  for  accessinq  the  literature  and  which  can 
also  be  used  ir  conjunction  with  ether  files  to 
abbreviate  references  within  SOLAR. 

(3)  A file  of  semantic  analyses,  which  contains  formal 
treatments  of  the  semantic  properties  of  individual 
words  as  found  in  the  literature.  This  file  is  beinq 
built  manually,  by  locatinq  and  readinq  documents 
relevant  to  the  SOB  lexicon  words,  extract  inq  the 
essence  of  each  document's  analysis,  writinq  a critique 
of  it.  and  onterinq  this  information  on  sheets  for 
keypunchinq.  Althouqh  each  analysis  of  a particular 
word  is  treated  separa**! /.  the  analyses  are  tied 
toqether  by  cross-referencinq  in  the  critiques  appended 
to  the  analyses. 

(4)  A file  summarizinq  the  theoretical  backqrounds  from 
which  the  semantic  analyses  have  been  extracted. 

(5)  A file  explaininq  and  comaentinq  on  the  descriptive 
constants  employed  in  the  semantic  analyses. 

(6)  A file  of  inteqrated  summaries  of  analyses  qiven  in  the 
literatures  of  philosophy  and  artificial  intelliqence 
for  concepts  invoked  by  the  descriptive  constants. 

(7)  A file  of  collocational  information  found  in  the 
definitions  of  jsiuifiils  SfiifiDUi  Max  SallsaiaSs 
Dictionary  (N7)  , which  has  been  sachine-extracted  and 
is  accessible  via  the  words  to  which  it  pertains. 

(8)  A file  of  definitional  links  between  words  within  a 
Particular  SOB  lexicon.  These  ere  beinq  constructed  so 


IS  Roves  ber  1974 


26  System  Davelopaent  corporation 

TH-524 3/002/00 


that  a user  can  observe  tha  aaaantic  interrelations  in 
his  lexicon. 

(9)  A file  of  aaaantic  fialda#  which  ara  being  designed  for 
each  SUB  word  by  tying  to  it  aorda  found  in  certain 
definitional,  aynonyaitiva,  and  antoaysitlve 
relationships  in  V7,  Bebatar 1 a pair  Piet lonarv  oi 
Synonyms  (HMDS) , and  Boast ?s  Inter national  Thesaurus 
(Roqet) . 

(10)  Finally,  every  context  of  each  SOU  word  as  found  in  the 
w7  definitions,  in  tha  Brown  Corpus,  and  in  selected 
speech  dialoqs  is  being  entered  in  a keyword -in-context 
(KHIC)  file. 

For  a nore  detailed  discussion  of  the  contents  of  each  of  these 
files,  see  oilier  and  Olney  (1973). 


3.  2.  2 DflaJ  CQllSStiflJ)  flX 

A significant  amount  of  data  has  been  collected  for  the 
hand-built  files  (files  1-6,  above).  The  approxinately  3,000  SUR 
words  existinq  as  of  Septeaber,  1974,  have  been  entered  into  the 
word  index  toqether  with  their  W7  parts  of  speech  and  an 
indication  of  the  solar  data  available  for  each.  About  2,700 
references  to  documents  in  the  linguistics  and  philosophical 
literature  have  been  collected  by  LDA  personnel  and  entered  into 
the  biblioqraphic  file.  Slightly  nore  than  1,600  other  citations 
to  articles  in  experimental  phonetics,  psychoacoustics,  speech 
analysis  and  synthesis,  and  phonology  have  been  entered  throuqh  a 
data  exrhanqe  with  Bell  Laboratories.  Another  1,000  entries  in 
psvcholxnqulst ics  and  phonoloqy  have  been  entered  throuqh  an 
arranqeaent  with  the  UCLA  Departsent  of  Linguistics.  Indexinq  by 
author,  title,  and  keyword  (aacnq  other  parameters)  is  possible 
for  each  bibliographic  entry.  Approxinately  300  semantic 
analyses  have  been  written,  and  about  150  have  been  converted 
into  nachine-readable  data  sets  to  facilitate  updating  and 
distribution.  (Section  3.2.2,  titled  "Sample  Semantic  Analyses", 
of  the  interim  report  (Bernstein,  1974)  presents  four  such 
seia  itic  analyses.)  Explanations  of  about  200  descriptive 
constants  used  in  the  semantic  analysss  have  been  written,  and 
abo  '*■.  100  have  been  computerized.  (In  Section  3.2. 3 of  the 
interis  report,  some  sample  explanatory  notes  were  qiven. ) The 
file  containing  inteqrative  summaries  of  conceptual  analyses  hss 
received  considerable  attention  in  recent  months.  Approximately 
20  suaearies  have  been  written;  half  of  then  have  been 
computerized  and  ace  now  beinq  translated  into  a forma 1-loqic 
lanquaqe  to  facilitate  their  incorporation  into  tha  knowladqa 
structures  of  the  SUR  systeas.  (In  Section  3.2.4  of  the  interis 
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report.  three  saapla  conceptual  analyse*  were  presented.) 


3.2.3  flftGftiM  JfeCilAJtifiA  fi£  Mk&3 

Several  proqraas  have  been  developed  to  allow  the  creation  of  two 
of  the  four  eachine-deri*ed  data  seta.  First,  a set  of  proqraas 
was  written  that  restructured  the  V7  parsed  transcripts  into  a 
format  suitable  as  input  to  other  proqraas.  This  set  includes 
one  prooraa  that  reaaseables  the  definitions  fron  their  parsed 
foreat  and  another  that  extracts  fro*  all  of  H7  just  that  subset 
of  definitions  relevant  to  the  current  list  of  SUR  words. 

Second,  a proqraa  for  buildinq  the  collocational  feature  file  was 
written,  coapiled,  debuqqed,  and  run.  The  result inq  file 
contains  about  10,000  lines  of  text  containinq  47  definitions 
that  show  peraissible  contextual  features  for  particular  senses 
of  the  SUR  words.  Third,  the  proqraa  that  converted  the  Brown 
Corpus  KHIC  data  set  to  SOLAR  foreat  was  written  and  run.  since 
we  liaited  the  nueber  of  saaple  contexts  pet  word  to  a ea  uaue  of 
450,  the  resultinq  file  contains  about  300, 0^3  lines.  The 
addition  of  the  w7  definitional  contexts  is  awaitinq  an 
evaluation  of  the  utility  of  the  Brown  contexts.  He  are 
currently  addinq  contexts  froa  dialoqs  collected  by  the  SUR 
aroups. 


3.2.4  SroatABB  tt&lei  SglSiSMBMAX 

considerable  proqress  was  aade  in  creatinq  the  proqrans  and  data 
sets  needed  to  build  the  file  displayinq  definitional  links 
between  words  in  particular  lexicons.  The  data  sets  neinq  built 
conprise  the  words  particular  to  a qiven  lexicon,  the  syntactic 
parts  of  speech  for  each,  the  R7  definitions  for  each,  words 
standinq  in  an  inflectional  relationship  to  the  core  lexicon,  and 
a list  of  stop  words  for  which  no  definitional  licks  are 
followed. 

work  on  the  file  of  senantic  fields  has  centered  eainly  on  the 
definition  of  a data  structure  and  the  collection  of  relevant 
data  sets.  Keypunching  of  the  antonyn  relations  found  in  HMDS 
beqan  in  October,  1974.  The  quasi-synonyaitive  relations  found 
in  Poqet.  eust  await  the  release  of  the  Roqet  transcript  by  the 
Sedelow  qroup  at  the  University  of  Kansas.  Prof.  Sedelov  expects 
to  coaplete  the  edit  inq  of  the  transcript  near  the  end  o'  1974. 


3.  2.  5 DdtS  BAJ}Jfl£B£Ut 

The  six  files  that  are  currently  computerized  are  accessible  via 
COBS,  an  SCC  data  vanaqenent  systen  with  exceptional  update  and 


15  Moveaber  1974 


28  System  Development  Corporation 

TH-5.\  4 3/002/00 


report  generation  capa  hll  It  lea.  Thia  syatea  baa  greatly 
facilitated  the  creation  of  the  hand-derived  files.  However# 
since  the  system  will  not  he  availab?.e  after  October  31#  1974# 
consider  ie  effort  was  spent  late  in  contract  year  1974  in 
preparing  to  move  all  SOLAS  data  to  another  SDC  data  aanaqeaent 
systea  that  is  accessible  via  the  ABPAVET.  The  loqical  structure 
of  all  SOLA h files  was  reevaluated  and  revised  where  necessary# 
and  the  first  of  eiqbt  proqraas  needed  to  convert  the  data  sets 
to  the  revised  format  was  coded  and  run. 


3.2.6  Data  Disjuifeyii&E 

Early  this  year,  the  archive  was  publieixed  throughout  the  United 
States,  Cai.ada,  Australia,  and  Europe.  Approximately  20 
researchers  responded  to  our  solicitation  for  documents  dealing 
with  lexical  semantics,  and  about  35  expressed  interest  in 
receiving  data  froa  the  archive. 

In  April.  1974,  initial  distribution  of  author  ar.d  keyword 
indices  to  the  biblioqraphic  citation  file  was  made  to  the  SUB 
crolects  and  to  five  university  linguistics  departments.  In 
October.  1974,  a revised  listing  of  the  citation  file  was 
distributed,  toqcther  with  initial  listings  of  the  word  index, 
the  semantic  analyses,  the  descriptive  constants.,  the  conceptual 
ar.alvses,  the  collocational  features#  and  portions  of  the  KH7C 

file. 


3.  3 PLAHS 

Durinq  the  next  (1974-75)  contract  year,  the  LDA  staff  will  focus 
on  five  tasks.  Pirst,  we  will  continue  data  collection  from  the 
literature.  This  '.‘ill  involve  extending  the  bibliographic  files# 
more  than  doubling  the  seaantic  and  conceptual  analysis  files# 
and  updating  the  word  index.  The  updating  activity  derives  from 
the  continual  addition  of  words  to  the  S0R  lexicons. 

Second,  we  will  continue  pcoqram  development.  Soae  restructuring 
of  files  and  programs  is  expected  as  a result  of  feedback  from 
users  regarding  the  utility  of  each  file.  He  will  also  be  coding 
and  runninq  the  proqraas  needed  to  produce  the  two  remaining 
aachine-derived  tiles  (i.e. # the  file  linking  words 
def initionally  and  the  file  of  seaantic  fields).  Included  in 
this  effort  will  be  the  keypunching  of  antonyaitive  relations 
found  in  HMDS.  Those  will  be  added  to  the  seaantic  field  file  to 
permit  their  incorporation  into  the  semantic  networks  of  the  SUB 
svstees. 
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Third,  we  will  complete  the  work  necessary  to  put  SOLAS  on  the 
A8PA  Network  and  distribute  user's  quides  indicatinq  on-line 
accessing  procedures.  Fourth,  we  will  continue  to  disseainate 
data  ft  os  each  of  the  files.  Lastly,  to  facilitate  use  of  the 
archive.  **•  will  continue  to  docuaent  the  archive  (produclnq 
further  user's  quides)  and  will  provide  deacnstca tions  of  the  use 
of  SOLAR. 

In  the  succeedinq  (1975-76)  contract  year,  the  LDA  staff  will 
focus  on  the  followinq  tasks: 

(1)  Refine  the  data  aanaqeaent  output  to  tailor  it  nore 
lirectlv  to  'r.e  specific  SOR  proiects  requestinq  data 
(i.e.,  improve  the  selectiveness  of  data 
dissemination)  . 

(2)  Revise  the  seaantic  field  file  displays  (and  perhaps 
the  file  structure)  in  accordance  with  suqqestions  froa 
users  as  to  how  the  utility  of  the  file  could  be 
enhaoccd. 

(3)  Update  each  of  the  files  on  the  basis  of  the  new  words 
added  to  the  SOR  lexicons. 

(4)  continue  hand  collection  cf  data  for  the  ti blioqraphic 
reference,  seaantic  field,  and  conceptual  analysis 
files. 

(5)  Explore  the  Halts  of  a lqor ithaically  producinq  a 
seaantic  network  for  a qiven  lexicon  froa  the  data 
residing  in  SCLAR. 


3.4  STAFF 

Dr.  Tiaothy  C.  oilier.  Prolect  Leader 

John  01n«»y  (Consultant) 

Thoaar  Bve  (part-tine) 

Frank  Heath  (part-tine) 
flartin  Mould  (part-tine) 

Kathan  Ucuzoqlu  (part-tine) 
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4.  CQ00QJ  IIZUMXUJ  SU 0CTDR13 
4.1  t NTRODOCTIOR 

The  coaaon  In  for nation  structures  prolect  la  addressinq  the 
problea  of  convertinq  and  tcaaafertinq  data  bases  aaonq  disparate 
data  aanaqeaent  systeas  (DNSs).  The  needs  for  sharinq  data  for 
different  applications  and  for  transferriaq  eiiatinq  data  into 
nev  coappter  systeas  sake  it  apparent  that  qeneral  techniques  for 
data  base  conversion  are  desirable.  It  is  equally  desirable  that 
theae  techniques  be  relatively  easy  to  iapleaent  and  use.  The 
sal  of  this  prolect  is  to  develop  techniques  for  data  base 
conversion  that  are  practical  for  applicaton  to  current  data 
aanaqeaent  evsteas  and  that  *re  desiqned  to  be  used  easily  by 
data  base  users. 

The  difficulties  in  convertinq  a data  base  froa  one  data 
aanaqeaent  svstea  (DBS)  to  another  arise  froa  the  fact  that  data 
base  structures  are  systea  and  application  dependent,  is  a 
result,  a CHS  ia poses  constraints  on  the  fora  of  the  data  bases 
residing  in  it.  These  constraints  are  of  three  types:  (1) 

logical-level  constraints,  such  as  level  of  hierarchies,  size  and 
nuaber  of  fields,  and  data  types:  (2)  storage-level  constraints, 
such  as  inversion  and  access  paths  of  files  and  file  indexinq 
organizations:  and  (3)  physical- level  constraints,  suer,  as 
phvsical  devices  used  and  block/record  structures.  These  levels 
were  described  in  reports  issued  for  the  1972-73  contract  year. 

The  conventional  aethod  cf  convertinq  data  bases  for  nev 
applications  is  to  write  a special-purpose  conversion  proqraa  for 
each  data  base.  Another  possible  approach  is  to  define  data 
description  languages  for  all  three  levels  (loqical.  storaqe.  and 
rhvsical),  then  specify  in  these  lanquaqes  the  source  and  target 
data  bases,  as  veil  as  conversion  stateaents  between  then.  This 
approach  has  been  discussed  by  several  researchers.  (See  Saith. 
1971:  Raairez.  1973:  Stored  Data  Definition  and  Translation  Task 
Group.  1972:  Saith,  1972;  Fry,  et  al.,  1972.)  Since  this  approach 
involves  all  three  levels,  it  requires  couples  and  detailed  data 
description  lanquaqes.  vhich  are  difficult  to  learn  and  to  use. 

It  also  requires  that  data  be  converted  froa  the  source  physical 
cnviionaent  to  the  corresponding  tarqet  physical  environaent, 
vhich  further  coaplicates  any  possible  iapleaentation. 

The  approach  beinq  taken  by  this  prolect  is  based  on  the 
assuaption  that  the  data  conversion  process  can  depend  aainly  on 
conversion  at  the  loqical  level.  Conversion  at  this  level  can  be 
achieved  bv  using  existing  query  and  qenerate  capabilities  of 
drss  to  aove  data  froa  their  physical  representation  to  the 
loqical  level  and  vice  versa.  The  tasks  required  in  the  data 
conversion  process  are  diaqraaaed  in  Figure  4-1.  First,  the 
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source  data  base  is  retrieved*  via  the  query  capabilities  of  the 
source  OHS,  and  reforaatted  into  a standard  fora.  Then*  the  dat' 
translation  process  takes  place*  and  a tarqet  data  base  in  the 
standard  fora  is  produced  and  reforaatted  into  a data  foraat 
acceptable  to  the  qenerate  capability  of  the  tarqet  DUS. 

Finally*  the  tarqet  data  base  is  qenerated  with  the  qenerate 
capability  of  the  tarqet  DBS. 


Figure  4-1.  The  Data  Conversion  Process 


4.2  P80GPESS  A ND  PR2SFNT  STATUS 

The  first  task  of  this  prelect  durinq  the  present  contract  year 
was  to  explore  the  possibilities  of  autoaatically  qeneratinq 
tarqet  data  descriptions  froi  source  data  descriptions.  He 
studied  the  file  definition  lanquaqes  (FDLs)  of  several  DflSs  in 
an  atteapt  to  isolate  the  restrictions  these  systeas  iapose  on 
data  structures.  After  thocoMohly  analyzinq  these  FDLs,  we 
concluded  that  autoaatic  generation  of  a tarqet  description  fros 
the  source  description  is  either: 


15  Noveabsr  197* 


32  Systea  Development  Corporation 

TH- 529 3/002/00 


(a)  Trivial— when  the  restrictions  on  the  cosplexity  of 
data  base  structures  of  the  tarqet  systea  are  less 
constraiainq  than  the  restrictions  of  the  source  systea 
(for  exaaple,  qoinq  froa  a systea  that  allows  one  level 
of  hierarchy  into  a systea  that  allows  sine  levels  of 
hierarchy):  or 

(b)  lapossible  to  predict  in  the  qenetal  case*  because  the 
tarqet  description  is  aeaantlcally  dependent  on  the 
intended  use  of  the  data  base  and  on  considers tiors  of 
tice*  space*  and  cost. 

Consequently,  we  concluded  that  it  would  not  be  fruitful  to 
continue  in  this  direction*  end  we  decided  to  direct  the  sain 
effort  of  the  pcoiect  at  the  developaent  of  processes  that 
actually  convert  data,  rather  than  at  the  autoaatic  generation  of 
target  data  descriptions.  In  the  context  of  developing  the 
details  of  the  data  conversion  process*  we  perforaed  the  several 
ta-sks  described  in  the  followinq  paraqraphs. 

A detailed  study  was  aade  of  the  different  data  description 
lat.qunuf's  and  data  conversion  approaches  described  in  the 
literature.  (See  Sibley  aod  Taylor*  1973;  Taylor*  1971;  Ssith* 
1971;  C'JCASYL  Systcss  Cossittee*  1971.)  We  concluded  that  because 
there  approaches  advocate  the  use  of  details  cf  all  three  levels 
of  data  description*  they  are  too  difficult  t.o  iaplenent  and  use. 
This  led  us  to  the  current  approach*  which  depends  mainly  on  the 
loqical  level  of  data. 

A acthodoloqy  for  the  data  conversion  process  was  developed.  It 
involves  the  developaent  of  source  and  tarqet  reforaatters  and  a 
loaical  data  translator.  These  processes  are  driven  by 
statements  written  in  three  lanquaqes*  which  are  dependent  aainly 
on  loqical  characteristics  of  the  data  to  be  converted.  These 
lanquaqes  are  the  Coeaon  Data  Description  Lanquaqe  (CDDL)  * the 
coteon  Data  Translation  Lanquaqe  (CDTL)  * and  the  Coeaon  Data 
Forest  Lanquaqe  (CDFl) . 

A preliminary  Version  of  CDDL  was  developed.  This  was 
accomplished  bv  anmlvtinq  existinq  DDLs  (especially  those  of 
CODASYL  and  Ssith)  and  selecting  the  subset  that  is  relevant  to 
our  approach.  The  syntax  of  CDDL  represents  a loqical  view  of 
hierarchical  data  stroctures.  Because  it  contains  no 
ator aae-level  or  physical-level  requireaents*  it  is  a siaple 
lanquaqe  for  a user  to  specify,  a file  statesent  consists  of 
qiou:>  and  field  statements,  h qroup  statesent*  in  turn*  consists 
of  field  statements  and*  possibly*  additional  qroup  st.ateaents* 
thus  establishing  a hierarchical  structure.  There  are  several 
types  for  fields,  and  a repetition  nuaber  for  groups  puts  an 
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upper  Halt  on  the  number  of  values  in  a qtcup  inatance.  The 
statements  in  CDDL.  toqethec  with  stateaents  in  CDTL*  supply  all 
the  information  necessary  for  the  data  conversion  process,  he  a 
consequence*  CDDL  aiqht  chanqe  accordinq  to  the  requirements  of 
CDTL* 

The  functions  necessary  foe  COTL  we re  developed.  These  functions 
are  represented  in  terns  of  conversion  stateaents  between  fields 
of  the  source  data  description  and  the  tarqet  data  description. 

A conversion  statement  consists  of  an  association  of  one  source 
field  (ol  more)  to  a tarqet  field*  plus  an  alqoriths  for  the 
conversion  ot  data  (such  as  truncation*  concatenation*  or 
data-tvno  transfer  eat icn) . Types  c i c ,nversicn  statements  were 
identified  and  fori  the  basis  of  CDTL.  In  brief*  these  types  are 
as  follows: 

(1)  IistdDC0  — represents  a aappinq  of  instances  of  a field 
oi  a repeating  qrcup  (RG)  into  a field  cf  a hiqher-level 
nr.. 

(2)  I'ujodle  — represents  a coabination  of  multiple  values  ot 
fields  of  the  sane  RG  level  into  an  RG  instance  of  a 
lower  level. 

(11  OperoilPD  — allows  a set  of  values  in  an  RG  instance  to 
Le  combined  bv  some  operation  (e.q. , Averaqc)  into  one 
value  of  a field  in  a higher-level  RG. 

(u)  iitegt  — allows  for  an  association  of  source  and  tarqet 
fields  accordinq  to  a qiven  algorithm  (e.q., 
truncat.  ion) . 

(5)  IJepe^jt  — necessary  when  a repetition  of  a field  value 
thrcuqh  values  cf  a lower- level  RG  is  requited. 

(6)  lexeliiD  --  can  be  used  to  create  an  upper-level  target 
PG  from  a source  RG  that  has  repeating  values. 

(7)  Coacaiej) jlifin  — necessary  when  a tarqet  field  is  made 
up  by  concatenating  source  fields  (or  portions  of  then). 

(8)  4.5rAS  “-  used  when  a portion  of  the  source  data  is  to  be 
moved  unchanged  to  the  tarqet  data. 

(9)  iL.VDLSiQn  — necessary  when  an  alternative  view  of  the 
data  base  is  required  (e.q.,  a department-employee  data 
based  need3  to  be  reorganized  as  an  employee-department 
dat a base)  . 


The  identification  of  conversion  types  led  to  the  problem  of 
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deteraininq  which  aappinq  types  can  aeaniaqfully  coexist.  Ha 
discovered  properties  that  quids  us  in  ideatifyinq  those 
coabinations  that  are  seaantically  aad  logically  sound  and  those 
that  should  be  relected.  Por  exaapla,  if  thara  is  a DIRECT 
between  a field  of  a source  RG  and  a field  of  the  corresponding 
tarqet  RG,  than  no  other  type  between  fields  of  those  RGs  is 
possible.  He  defined  the  concept  of  "corr espondence”  between 
source  and  tarqet  RGs:  usinq  it,  we  could  talk  about  "upn 
aappinqs  and  "down"  aappinqs.  Thus,  coabinations  of  "up"  and 
"down"  aappinqs  cannot  coexist  between  fields  of  the  sane  source 
and  tarqet  RGs.  The  outcoae  of  this  staqe  of  seaant  ic  analysis 
was  the  definition  of  the  CfTl.  which  expresses  conversion 
aappinqs  in  terns  of  f ield-to-field  aappinqs  only.  This  qreatly 
siaplifies  specifyinq  the  required  conversion  aappinq  j. 

A version  of  the  CDTL  was  defined.  Por  reasons  explained  above, 
it  is  a fairly  siaple  xanquaqe,  consistinq  aainly  of 
f ield-to-f ield  aappinqs  between  source  and  tarqet  data 
structures:  this  siaplicity  is  a aost  iaportant  property  for 
users.  Another  feature  that  we  included  in  the  CDTL  is  the 
ability  to  specify  a strinq  aodif icatlon.  This  facility  allows 
field  values  to  be  aodlfied  and  rtconfiqured  in  a way  siailar  to 
that  of  the  Data  Reconf iquration  Service  (Anderson,  et  al., 

1971).  However,  we  did  not  find  it  necessary  to  include 
facilities  such  as  explicit  conditionals. 

The  design  of  the  standard  data  forsat  was  cospleted.  Two 
L'tooecti^s  that  we  found  iaportant  had  to  be  ccaproaised  because 
of  conflict:  (1)  an  efficient  way  of  roadinq  values  in  a data 
record,  usir.q  hierarchy  levels  and  instance  occurrence,  and  (2) 
the  need  to  leave  a portion  of  the  data  virtually  unchanqed  when 
the  AS-I1  function  is  specified  (i.e. , a portion  of  the  source 
data  to  be  converted  must  ce  left  unchanqsd,  so  that  the 
converter  car.  integrate  this  portion  of  the  source  data  into  the 
taruet  data  heinq  foraed).  We  discovered  an  eleqant  wa y ot 
coarcoaisiiq  these  requite scuts,  by  defininq  a top-to-bottoa, 
left-to-r iqht  linearization  of  a hierarchy  instance,  toqethor 
with  eabedded  relative  displeceaents  linkinq  instances  of  the 
different,  levels.  A detailed  description  of  this  structure  will 
be  presented  in  a future  docuaent. 

The  desiqn  of  the  translator  was  developed.  This  aalor  task 
consists  of  two  parts.  The  "front  end"  perfocas  a lexical  and 
semantic  analysis  of  the  CDDL  and  the  CDTL  statements,  usinq  the 
cappinq  properties  referred  to  above.  It  shou.,.1  detect  any 
loqical  or  seaant ic  inconsistencies  and  produce  a "conversion 
table"  for  the  second  pact,  the  "data  converter".  The  conversion 
taMe  contains  step-by-step  instruction  entries  that  drive  a 
cont roller.  The  controller  then  invokes  the  appropriate  routines 
to  perfora  the  mapping  functions.  The  dita  ccnverter  operates  on 
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a sourca  data  instance  in  the  standard  fora  and  qeneratea  a 
taroet  data  inataace  in  the  standard  fora*  according  to  the  rDT L 
specif tea t ions. 


Host  of  the  aoduiea  of  the  translator  have  been  desiqned  and  are 
outlined  in  flowchart  fora.  Usinq  then*  we  plan  to  start 
iapleaenta tion  of  the  translator  in  PL/I  in  October*  1974. 


4.3  PLANS 

Our  goal  for  the  next  contract  year  is  to  iapleaent  a prototype 
of  the  data  base  translator*  as  well  as  to  develop  and  desiqn  the 
souice  and  target  data  reforaatters.  The  following  tasks  are 
c lanned. 


**.3.1  iJPlSieAtdlaSB  Qt  Iht  J24LA  l£AQ§I&£2£ 

This  task  is  broken  into  two  subtasks: 

fa)  Lexical  and  semantic  analysis  of  the  source  and  tarqet 
data  base  descriptions  in  CI.DL  and  the  translation 
stuteaents  in  CDTL.  This  step  will  produce  internal 
tables  that  represent  the  data  translation  operations 
to  be  performed.  The  analyzer  will  operate  in  the 
following  manner:  After  the  CDDL  and  CDTL  stateneits 

ure  read  and  checked  for  their  syntax  leqality,  the 
sonant ic  analyzer  checks  whether  the  translation 
requests  are  seaantically  possible;  the  translation 
tables  are  then  produced. 

(b)  Tta nsla t.ion  of  the  source  data  (in  their  standard  fore) 
to  the  tarqet  data  (iti  standard  fore).  In  this  step, 
the  translator  uses  the  translation  tables  produced  in 
the  previous  step  as  follows:  A controller  reads  table 

entries  in  sequence  and  interprets  the  f ield-to-f ield 
type  aappinqs  to  be  perforned.  It  invokes  an 
appropriate  nodule  (one  for  each  of  the 
nappinqs — DIRECT*  INSTANCE,  etc.)*  which  in  turn  calls 
a "read”  module  to  extract  the  appropriate  data  from 
the  source  records.  After  the  desired  value  is 
obtained,  the  controller  invokes  a "write”  nodule  to 
qenerate  the  desired  part  of  the  tarqet  record.  This 
operation  repeats  until  the  complete  tarqet  record  is 
generated.  Records  are  qe»ier.?ted  until  all  of  the 
source  data  have  been  translated. 
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Two  additional  nodules  are  planned.  One  Module  operates  before 
the  translation  process  starts  and  is  called  the  "subsetting" 
nodule.  This  openhion  is  required  when  we  want  to  consider  only 
a subset  of  the  data  base  for  translaticn  (e.q.,  only  the  feeale 
employees) . The  other  nodule  is  for  post -t.ranslat ion  ordering 
ana  is  used  to  order  the  target  records  after  translation  has 
been  ecepl'ted. 


4.3.2  Design  qL  tfc<;  JBfilflliftiieis 

This  tast  will  require  a study  of  the  coason  input  and  output 
data  formats  of  data  aanaqeaent  ayateas.  He  project  that  only  a 
few  basic  forauts  will  be  identified.  These  forvats  could  be 
used  to  define  a Cot  eon  Data  Poraat  Lanquaqe  (CDF L) . The 
iepl.-a.-nta  tion  of  data  reforsatters  would  then  use  a data  forsat 
specification  i r«  CDFL  to  perfora  the  reforaatting.  Another 
alternative  is  to  build  special-purpose  reforsatters  for  every 
data  format  type,  an  alternative  that  seess  to  be  the  sore 
attractive  choice  when  the  nuetier  of  forsat  types  is  ssall. 


4.3.j  I'xPfj.issAidl  5<JHU‘ii5iQL  2t  i Ml  BOSE 

Towai:  th*  end  of  the  contract  year*  we  plan  to  experiaent  with 
the  data  translator  by  taking  a ssall  but  useful  data  base  froa 
appa-DSS  and  converting  it  into  a fora  acceptable  to  the 
Ear aconrut ei . Initial  discussions  with  the  people  involved  at 
AliPA  tnd  o A save  ceen  held. 
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APPENDIX 


The  text  of  this  report  was  prepared  on-line  usinq  the  CBS/EDIT 
facility  provided  under  the  IBB  VH/370  states.  This  report  is 
available  in  machine-readable  fora. 


